At first, we wanted to analyse the impacts of the recent global pandemic on individuals within the context of a variety of mental health conditions, namely anxiety and depression, in a pre- and post-COVID framework. However, we decided to drop the idea due to a lack of data available regarding the topic, which wouldn’t have allowed us to conduct a thorough analysis. As we were searching for new topic ideas, we found a dataset that concerns shark attacks around the world. A group member has then told us about an incident that occurred during his last holidays in Egypt, involving a shark attack at a paradisiac beach. We discussed thereafter the presence of these predators in many different regions in the world including Florida, Hawaii, and Australia just to name a few. For this reason, we decided to dive deeper into the topic of shark attacks and explore potential influencing factors. After a few days of research and information gathering, we found that a wide range of factors are capable of influencing such incidents, both directly and indirectly. We all enjoy going on holidays and enjoy beaches while drinking a cocktail, and when we feel that temperature increases, we go in the water to refresh, but depending on the location we are, this could be unsafe.
Another motivation behind this project is our interest in the preservation of marine fauna. Very often human beings tend to primarily blame the animals for their fierce behavior (just look at the example of the attack that occurred in Hurghada during summer of 2023: after the death of the victim, the shark was killed!) but we often forget that the water it is their natural habitat, and that we are the guests. This point motivates us even more to study the phenomenon of attacks and to disclose how human activity causes shark attacks.
The project’s focus is to provide an understanding of the complex dynamics of shark attacks globally, by conducting an in-depth analysis on the different factors that influence the phenomenon. In addressing the topic at hand, we explore both direct and indirect influencing factors.
At the end of our work we want to learn how it is possible to reduce accidents caused by sharks. On the one hand, this is crucial to guarantee the safety of people who take pleasure in coastal waters and engage in water-related activities. On the other hand, the conservation of aquatic animals might benefit from this. Making efforts to preserve these species is essential for sustaining the health of marine ecosystems in a time when the extinction rate of many animal species is rising. Among the factors that might have an impact on the trend of shark attacks we have climatic variations. Ocean temperatures and currents are capable of affecting migration patterns and habitat preferences, which further complicates the dynamics of shark attacks. By providing a step-by-step analysis on climate-related aspects, we aim to provide empirical evidence on the impact of environmental changes on shark movements and interactions with humans.
This project’s goal is to contribute to the understanding of shark-related incidents and thereby to provide possible strategies such as policymaking, which would help mitigate such occurrences, ultimately safeguarding human life and protecting sharks. In essence, we aim to advance research on sustainable coexistence with the marine environment through actionable insights and intend to contribute to the broader efforts of marine conservation.
In the second section of our work, we will present the datasets that are necessary to answer our research questions. These datasets serve as the foundation for our investigation and provide the data required to draw insightful and important findings. The purpose of this part is to provide a clear overview of the data landscape and make clear to the reader what are the information to be found in each web-based data, and how do they interrelate.
Our second dataset contains the average temperature change for each country (annually, monthly and seasonally) from 1961 to 2022. It contains 9,657 observations.
The dataset was taken from the website of the Food and Agriculture Organization of United Nations: https://www.fao.org/faostat/en/#data/ET
# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable
Variables <- c("Area Code", "Area", "Months ", "Month code ", "Element Code ", "Element", "Unit", "Years")
Description <- c("Code for each country", "Country for which the data has been collected", "Month during which the data on temperature has been collected", "Code of each month ", "Binary variable (7271: Temperature change; 6078: Standard Deviation)", "What is the data measuring (either temperature change or standard deviation of data collection)", "Unit of measure of temperature ", "Year of observations" )
Table2 <- data.frame(Variable = Variables, Description = Description)
# The knitr::kable() allows to visualize the table
knitr::kable(Table2)
| Variable | Description |
|---|---|
| Area Code | Code for each country |
| Area | Country for which the data has been collected |
| Months | Month during which the data on temperature has been collected |
| Month code | Code of each month |
| Element Code | Binary variable (7271: Temperature change; 6078: Standard Deviation) |
| Element | What is the data measuring (either temperature change or standard deviation of data collection) |
| Unit | Unit of measure of temperature |
| Years | Year of observations |
Our third dataset contains information on the global increase of sea level from 1993 to 2021. The following acronims are important to be noted: GMLS:Global mean sea level GIA: Global Isostatic Adjustment, which is the ongoing movement of land once burdened by ice-age glaciers (National Ocean Service, n.d.)
The dataset was taken from the website of the NASA: https://climate.nasa.gov/vital-signs/sea-level/
# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable
Variables <- c("Year", "TotalWeightedObservations ", "GMSL_noGIA ", "StdDevGMSL_noGIA ", "SmoothedGSML_noGIA ", "GMSL_GIA", "StdDevGMSL_GIA", "SmoothedGSML_GIA", "SmoothedGSML_GIA_sigremoved")
Description <- c("Year of the observation", "Total Weighted Observations of Sea Level", "Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (without considering the GIA)", "Standard Deviation of global mean sea level variation estimate in millimeters (without considering the GIA)", "Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (without considering the GIA)", "Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (considering the GIA)", "Standard Deviation of global mean sea level variation estimate in millimeters (considering the GIA)", "Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (considering the GIA)", "Smoothed (60-day Gaussian type filter) Global Mean Sea Level variation in millimeters (considering the GIA) ; annual and semi-annual signal removed ")
Table3 <- data.frame(Variable = Variables, Description = Description)
# The knitr::kable() allows to visualize the table
knitr::kable(Table3)
| Variable | Description |
|---|---|
| Year | Year of the observation |
| TotalWeightedObservations | Total Weighted Observations of Sea Level |
| GMSL_noGIA | Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (without considering the GIA) |
| StdDevGMSL_noGIA | Standard Deviation of global mean sea level variation estimate in millimeters (without considering the GIA) |
| SmoothedGSML_noGIA | Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (without considering the GIA) |
| GMSL_GIA | Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (considering the GIA) |
| StdDevGMSL_GIA | Standard Deviation of global mean sea level variation estimate in millimeters (considering the GIA) |
| SmoothedGSML_GIA | Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (considering the GIA) |
| SmoothedGSML_GIA_sigremoved | Smoothed (60-day Gaussian type filter) Global Mean Sea Level variation in millimeters (considering the GIA) ; annual and semi-annual signal removed |
Our fourth dataset relates to GHG emissions, one of the main concerns linked to global warming. The dataset contains 31.315 observations on annual GHG emissions per country and per continent, but we will focus on the first case only.
The dataset was taken from the website Our World in Data: https://ourworldindata.org/annual-co2-emissions
# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable
Variables <- c("Entity", "Code", "Year", "Annual COâ‚‚ emissions")
Description <- c("Country examinated", "ISO Code", "Year of the observation", "GHG emissions per year")
Table4 <- data.frame(Variable = Variables, Description = Description)
# The knitr::kable() allows to visualize the table
knitr::kable(Table4)
| Variable | Description |
|---|---|
| Entity | Country examinated |
| Code | ISO Code |
| Year | Year of the observation |
| Annual COâ‚‚ emissions | GHG emissions per year |
The first step of our work is data cleaning. While collecting data for a study, it might happen to have some variables with missing information, worthless data or spelling mistakes In order to run a regression, it is fundamental that each column within a data set has no missing values and that information is written following the same format. This procedure takes the name of data cleaning, and it will cover the whole subsection.
Please be aware that at the beginning of each subsection we only provide general instructions of our cleaning. Through the code, you will get additional in-depth explanations of each step we took.
The first dataset we will examine is the one pertaining to shark attacks. It was quite full of misspellings and missing values, which required a lot of time and a very dense code to be able to clean it.
For our work, we have decided to only study the phenomenon of shark attacks from 1970 forward. In fact, it is possible to observe in the raw dataset that, before to this date, observations had more erroneous information. This leads one to believe that, more likely than not, the information we needed is more accurate if we do not travel too far back in time.
While cleaning the column of Ages, we realized that NA were too much and we could not delete them all, because that would have meant losing 29,5% of our data. Therefore, we have decided to work in the following way: for each country that presents LESS NA than the total observation, we compute an histogram (Age against frequency) of non-missing data. If the distribution is normal, we substitute the NA with the mean. If it is left skewed, we substitute the NAs with the first quantile, if it is right skewed, we substitute it with the third quantile.
Here is an example with USA and Bahamas. For USA, the distribution of the non-missing victims’ ages is skewed on the left. Therefore we have decided to substitute all the missing values with the first quantile. For Bahamas, it is evident that data follow a normal distribution, therefore we replace NAs with the mean.
#EGYPT#
ages_in_egy <- attacks %>%
filter(Country == "EGYPT") %>%
select(Age)
ages_vector <- ages_in_egy$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_egy = mean(ages_without_na)
attacks$Age[attacks$Country == "EGYPT" & is.na(attacks$Age)] <- mean_egy
#ITALY#
# Extract all ages from individuals who come from Italy
ages_in_italy <- attacks %>%
filter(Country == "ITALY") %>%
select(Age)
# Convert the extracted ages to a vector
ages_vector <- ages_in_italy$Age
# Display or use the 'ages_vector' containing ages from individuals in Italy
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_age = mean(ages_without_na)
# Replace all the NA on Italy with the mean of people attacked in Italy.
attacks$Age[attacks$Country == "ITALY" & is.na(attacks$Age)] <- 37.375
#AUSTRALIA#
ages_in_aus <- attacks %>%
filter(Country == "AUSTRALIA") %>%
select(Age)
ages_vector <- ages_in_aus$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_aus = mean(ages_without_na)
attacks$Age[attacks$Country == "AUSTRALIA" & is.na(attacks$Age)] <- mean_aus
#SOUTH AFRICA#
ages_in_SA <- attacks %>%
filter(Country == "SOUTH AFRICA") %>%
select(Age)
ages_vector <- ages_in_SA$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_SA = mean(ages_without_na)
attacks$Age[attacks$Country == "SOUTH AFRICA" & is.na(attacks$Age)] <- mean_SA
#BRAZIL#
ages_in_BRA <- attacks %>%
filter(Country == "BRAZIL") %>%
select(Age)
ages_vector <- ages_in_BRA$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_BRA = mean(ages_without_na)
attacks$Age[attacks$Country == "BRAZIL" & is.na(attacks$Age)] <- mean_BRA
#FIJI#
ages <- attacks %>%
filter(Country == "FIJI") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "FIJI" & is.na(attacks$Age)] <- mean
#FRENCH POLYNESIA#
ages <- attacks %>%
filter(Country == "FRENCH POLYNESIA") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "FRENCH POLYNESIA" & is.na(attacks$Age)] <- mean
#HONG KONG#
ages <- attacks %>%
filter(Country == "HONG KONG") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "HONG KONG" & is.na(attacks$Age)] <- mean
#JAPAN#
ages <- attacks %>%
filter(Country == "JAPAN") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "JAPAN" & is.na(attacks$Age)] <- mean
#MEXICO#
ages <- attacks %>%
filter(Country == "MEXICO") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "MEXICO" & is.na(attacks$Age)] <- mean
#MOZAMBIQUE#
ages <- attacks %>%
filter(Country == "MOZAMBIQUE") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "MOZAMBIQUE" & is.na(attacks$Age)] <- mean
#NEW ZELAND#
ages <- attacks %>%
filter(Country == "NEW ZEALAND") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "NEW ZEALAND" & is.na(attacks$Age)] <- mean
#PHILIPPINES#
ages <- attacks %>%
filter(Country == "PHILIPPINES") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "PHILIPPINES" & is.na(attacks$Age)] <- mean
#REUNION#
ages <- attacks %>%
filter(Country == "REUNION") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "REUNION" & is.na(attacks$Age)] <- mean
#NEW CALEDONIA#
ages <- attacks %>%
filter(Country == "NEW CALEDONIA") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "NEW CALEDONIA" & is.na(attacks$Age)] <- mean
#We fixed those countries that had lots of NA. Now, for the remaining NA we decided not to do anything for a specific reason. Not only we have not cound the real information on the internet, but we also believe that doing the mean for those remaining situations was useless, because these are the cases when NA are either the same amount of total shark attacks (see Antigua with 1 shark attack and 1 NA), or NA are more than half of the total attacks (see Malaysia with 4 total attacks but 3 of them are NA).
count_na_by_country <- attacks %>%
group_by(Country) %>%
summarise(NA_count = sum(is.na(Age)))
attacks <- subset(attacks, !is.na(Age))
#___________________________________________________________________________________________________________________
#NOW WORK ON TIME
#sometimes hours are written in a different format compared to most of the others, which follow the format
#13h30. Since they are not a lot we just replaced them manually
attacks<- mutate(attacks, Time= ifelse(Time=="Possibly same incident as 2000.08.21", "", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="14h00 -15h00", "14h30", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="14h30 / 15h30", "15h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="09h30 / 10h00", "9h45", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="10h45-11h15", "11h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="Sometime between 06h00 & 08hoo", "7h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="18h15-18h30", "18h20", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="09h00 - 09h30", "9h15", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="10h00 -- 11h00", "10h30", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="11h115", "11h15", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "Between 05h00 and 08h00", "6h30", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "17h00 or 17h40", "17h20", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "13h345", "13h45", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "<a0> ", "", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "09h00 -10h00", "9h30", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "2 hours after Opperman", "", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "11h00 / 11h30", "11h15", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "Between 06h00 & 07h20", "6h40", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "30 minutes after 1992.07.08.a", "", Time))
#now, our objective is to classify hours in parts of the day. indeed, it is useless to keep hours as they are
#for a regression because we want times like 7h30 and 7h00 OR 15h00 and 16h00 to be read as the same thing.
# some of the rows already had the part of the day in it, but in order to work easily with all the column, we
#replaced "morning" with 8h00 etc. in this way, we're able to remove all the strings that are useless. finally,
#void rows have been replaced by an NA and hours, which were there as CHAR have been replaced by numeric
attacks$Time <- str_replace_all(attacks$Time, "\\bmorning\\b", "8h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bafternoon\\b", "15h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bevening\\b", "20h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bnight\\b", "23h00")
attacks$Time <- gsub("[^[:digit:]]", "", attacks$Time)
attacks$Time = na_if(attacks$Time, "")
attacks$Time <- as.numeric(attacks$Time)
# since we removed all the letters, hours now are not written as 8h00 or 15h30 but as 800 and 15h00. this is
# great for us, so that we can easily create a function that classifies those numbers as parts of the day. the
#function works this way: if a value is included between 0 and 500 (i.e. midnight and 5a.m.), the value is replaced
#by the word "night" etc.
timeoftheday <- function(time) {
if (!is.na(time)) {
if (time >= 0 && time < 500) {
return("night")
} else if (time >= 500 && time < 1200) {
return("morning")
} else if (time >= 1200 && time < 1730) {
return("afternoon")
} else if (time >= 1730 && time < 2400) {
return("evening")
} else {
return("") # Handle any other cases (if necessary)
}
} else {
return(NA) # Retain NA values
}
}
attacks$Time <- sapply(attacks$Time, timeoftheday)
attacks$Time <- tolower(attacks$Time)
attacks$Time <- na_if(attacks$Time, "")
sum(is.na(attacks$Time)) #this is the only one that still presents 1415 NA. we dont want to delete
#them all coz we'd lose 44% of our data.
table(attacks$Time)#table shows that most of them happen in the afternoon, while 2ns place is owned by
#morning. To confirm the higher frequency of attacks between 8am and 6pm is this artile (link JC found??)
#we explain this by the fact that, naturally, most of people swim during the day. therefore, what we do
#is replacing NA randomly by the proportion of afternoon, morning and evening.
sum(!is.na(attacks$Time))
#create function that substitutes the NA in time with one of the 4 parts of the day, based on their
#proportion presence in our dataset
non_na_time_proportions <- table(attacks$Time) / sum(!is.na(attacks$Time))
# Identify NA positions
position_of_na <- is.na(attacks$Time)
# Generate random values based on proportions
attacks$Time[position_of_na] <- sample(
names(non_na_time_proportions),
sum(position_of_na),
replace = TRUE,
prob = as.vector(non_na_time_proportions)
)
#_____________________________________________________________________________________________
#NOW WE WORK ON SEX
attacks$Sex <- gsub("lli", "", attacks$Sex)
attacks$Sex <- gsub("M ", "M", attacks$Sex)
attacks$Sex <- na_if(attacks$Sex, "")
attacks$Sex <- ifelse(is.na(attacks$Sex), "Unknown", attacks$Sex)
#__________________________________________________________________________________________________________________
#NOW WE WORK ON SHARK SPECIES
#Creation of a variable species where we delete all the numbers that concern the size of the shark
attacks$Species <- str_remove_all(attacks$Species, "[[:digit:]]")
#If there is an empty cell, we will put NA
species = na_if(attacks$Species, "")
#gsub to take off punctuation
species <- gsub("[[:punct:]]", "", species)
#Delete all the numbers followed by cm or m
species <- gsub("\\d+\\s*(cm|m)\\b", "", species)
#Delete word composed by 1 or 2 letters
species <- gsub("\\b\\w{1,2}\\b", "", species)
#Delete all the numbers
species <- gsub("\\d", "", species)
#Delete all unities as lb, kg or l
species <- gsub("\\b(kg|lb|b)\\b", "", species)
#If the word shark appears more than once in each cell, we only write shark once
species <- gsub("\\bshark\\b(.*\\bshark\\b)?", "shark", species, ignore.case = TRUE)
# Rewrite shark species names in one way, and do that for each species
species<- gsub("^.*White shark.*$", "White shark", species)
species<- gsub("^.*whitetip shark.*$", "White shark", species)
species<- gsub("^.*white shark.*$", "White shark", species)
species<- gsub("^.*Wobbegong shark.*$", "Wobbegong shark", species)
species<- gsub("^.*Wobbegong.*$", "Wobbegong shark", species)
species<- gsub("^.*Zambesi shark.*$", "Zambesi shark", species)
species<- gsub("^.*Zambezi shark.*$", "Zambesi shark", species)
species<- gsub("^.*whaler shark.*$", "Whale shark", species)
species<- gsub("^.*whale shark.*$", "Whale shark", species)
species<- gsub("^.*tiger shark.*$", "Tiger shark", species)
species<- gsub("^.*Tiger shark.*$", "Tiger shark", species)
species<- gsub("^.*thresher shark.*$", "Thresher shark", species)
species<- gsub("^.*spinner shark.*$", "Spinner shark", species)
species<- gsub("^.*Spinner shark feet.*$", "Spinner shark", species)
species<- gsub("^.*Spinner shark.*$", "Spinner shark", species)
species<- gsub("^.*spinner blacktip shark.*$", "Spinner shark", species)
species<- gsub("^.*Tawny nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*Nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*Mako shark.*$", "Mako shark", species)
species<- gsub("^.*mako shark.*$", "Mako shark", species)
species<- gsub("^.*Lemon shark.*$", "Lemon shark", species)
species<- gsub(".*lemon shark.*", "Lemon shark", species, ignore.case = TRUE)
species<-gsub("^\\s*shark\\s*$", "Unidentified shark", species)
species<- gsub(".*bull.*", "Bull shark", species, ignore.case = TRUE)
species<- gsub(".*blue.*", "Blue shark", species, ignore.case = TRUE)
species<- gsub(".*reef.*", "Reef shark", species, ignore.case = TRUE)
species<- gsub(".*sand shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub("^.*Sand shark.*$", "Sand shark", species)
species<- gsub(".*Sand shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub(".*sandshark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub(".*Sandbar shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub("juvenile\\s+\\w+", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("Juvenile shark shark", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("Juvenile shark blacktip shark", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("blacktip\\s+\\w+", "Blacktip shark",species, ignore.case = TRUE)
species <- gsub("\\black\\w*\\b", "Blacktip shark", species, ignore.case = TRUE)
#Replace all occurrences of the word "blacktip" in a column with "Blacktip Shark," even if the word is accompanied by other words before or after
species <- gsub("\\bblacktip\\b", "Blacktip Shark", species, ignore.case = TRUE)
species<- gsub(".*copper shark*", "Copper shark", species, ignore.case = TRUE)
species <- gsub("\\bcow\\b", "Cow", species)
species <- gsub("\\bsilky\\b", "Silky", species)
species <- gsub("\\bsilvertip\\b", "Silvertip", species)
#Condition: if "Hammerhead" followed by other words then replaced by Hammerhead shark
species <- ifelse(grepl("Hammerhead\\s+\\w+", species, ignore.case = TRUE), "Hammerhead shark", species)
species <- ifelse(grepl("Blacktip\\s+\\w+", species, ignore.case = TRUE), "Blacktip shark", species)
species <- ifelse(grepl("Raggedtooth\\s+\\w+", species, ignore.case = TRUE), "Raggedtooth shark", species)
species <- ifelse(grepl("Porbeagle\\s+\\w+", species, ignore.case = TRUE), "Porbeagle shark", species)
#if the word shark does not appear then NA
species <- ifelse(grepl("shark", species, ignore.case = TRUE), species, NA)
species <- ifelse(grepl("Juvenile\\s+\\w+", species, ignore.case = TRUE), "Juvenile shark", species)
species <- ifelse(grepl("grey\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("greycolored\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("gray\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("Broadnose\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("7gill\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("sevengill\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("black\\s+\\w+", species, ignore.case = TRUE), "Blacktip shark", species)
species <- ifelse(grepl("Sand\\s+\\w+", species, ignore.case = TRUE), "Sand shark", species)
species <- ifelse(grepl("Carpet\\s+\\w+", species, ignore.case = TRUE), "Carpet shark", species)
species <- ifelse(grepl("\\bbrown\\b", species, ignore.case = TRUE), "Brown shark", species)
species <- gsub("^frag\\w+", "Unidentified shark", species)
species <- gsub("^shark$", "Unidentified shark", species)
#Function to check if "small" is present in cell
contains_small <- function(text) {
return(grepl("small", text))
}
# Replace cell with "unidentified shark" if "small" is present
species <- ifelse(sapply(species, contains_small), "Unidentified shark", species)
#Function to check if "Small" is present in cell
contains_small <- function(text) {
return(grepl("Small", text))
}
# Replace cell with "unidentified shark" if "Small" is present
species <- ifelse(sapply(species, contains_small), "Unidentified shark", species)
#Function to check if "sharks" is present in cell
contains_sharks <- function(text) {
return(grepl("sharks", text))
}
# Replace cell with"unidentified shark" if "sharks" is present
species <- ifelse(sapply(species, contains_sharks), "Unidentified shark", species)
#Function to check if "cookie" is present in cell
contains_cookie <- function(text) {
return(grepl("cookie", text))
}
# Replace cell with"Cookiecutter shark" if "cookie" is present
species <- ifelse(sapply(species, contains_cookie), "Cookiecutter shark", species)
#Function to check if "involvement" is present in cell
contains_involvement <- function(text) {
return(grepl("involvement", text))
}
# Replace cell with"No shark" if "involvement" is present
species <- ifelse(sapply(species, contains_involvement), "No shark", species)
#Function to check if "invovlement" is present in cell
contains_invovlement <- function(text) {
return(grepl("invovlement", text))
}
# Replace cell with"No shark" if "invovlement" is present
species <- ifelse(sapply(species, contains_invovlement), "No shark", species)
#Function to check if "Not" is present in cell
contains_Not <- function(text) {
return(grepl("Not", text))
}
# Replace cell with"No shark" if "Not" is present
species <- ifelse(sapply(species, contains_Not), "No shark", species)
#Function to check if "not" is present in cell
contains_not <- function(text) {
return(grepl("not", text))
}
# Replace cell with"No shark" if "not" is present
species <- ifelse(sapply(species, contains_not), "No shark", species)
#Function to check if "Questionable" is present in cell
contains_Questionable <- function(text) {
return(grepl("Questionable", text))
}
# Replace cell with"No shark" if "Questionable" is present
species <- ifelse(sapply(species, contains_Questionable), "No shark", species)
#Function to check if "questionable" is present in cell
contains_questionable <- function(text) {
return(grepl("questionable", text))
}
# Replace cell with"No shark" if "questionable" is present
species <- ifelse(sapply(species, contains_questionable), "No shark", species)
#Function to check if "Salmon" is present in cell
contains_Salmon <- function(text) {
return(grepl("Salmon", text))
}
# Replace cell with"Salmon shark" if "Salmon" is present
species <- ifelse(sapply(species, contains_Salmon), "Salmon shark", species)
#Function to check if "whaler" is present in cell
contains_whaler <- function(text) {
return(grepl("whaler", text))
}
# Replace cell with"Whale shark" if "involvement" is present
species <- ifelse(sapply(species, contains_whaler), "Whale shark", species)
# Replace the cell with "Unidentified shark" if any of the specified words are found
species <- ifelse(grepl("(seen|observed|Tooth|tooth|large|killed|captive|female|metre|foot|followed|caused|Said|young|probably|gaffed)", species, ignore.case = TRUE), "Unidentified shark", species)
# Replace the cell with "No shark" if any of the specified words are found
species <- ifelse(grepl("(hoax|No Shark)", species, ignore.case = TRUE), "No shark", species)
# Replace the cell with "Copper shark" if any of the specified words are found
species <- ifelse(grepl("(Copper)", species, ignore.case = TRUE), "Copper shark", species)
# Replace the cell with "Dogfish shark" if any of the specified words are found
species <- ifelse(grepl("(Dog|dogfish)", species, ignore.case = TRUE), "Dogfish shark", species)
# Replace the cell with "Dusky shark" if any of the specified words are found
species <- ifelse(grepl("(Dusky)", species, ignore.case = TRUE), "Dusky shark", species)
# Replace the cell with "Sevengill shark" if any of the specified words are found
species <- ifelse(grepl("(gill)", species, ignore.case = TRUE), "Sevengill shark", species)
# Replace the cell with "Angel shark" if any of the specified words are found
species <- ifelse(grepl("(Angel)", species, ignore.case = TRUE), "Angel shark", species)
attacks$Species <- species
attacks$Species <- ifelse(is.na(attacks$Species), "Unidentifies shark", attacks$Species)
#_____________________________________________________________________________________________
#ACTIVITY
#when trying to put everything in lower cap, this was the result: Errore in tolower(attacks$Activity) : multibyte string 921 not valid --> therefore, we converted everything in ASCII (=American Standard Code for Information Interchange)
attacks$Activity <- iconv(attacks$Activity, to = "ASCII", sub = " ")
attacks$Activity <- gsub("[^ -~]", "", attacks$Activity)
attacks$Activity <- tolower(attacks$Activity)
#when running a table of all activities, we can see that we can group them in some categories: diving, race, windsurfing, walking, wading, wade fishing, touching, swimming, surfing, surf, standing, spearfishing, snorkeling, scuba diving, playing, paddle, murder, kayaking, kayak, floating, fishing, so that we have standard categories.
attacks$Activity <- gsub("[^A-Za-z ]", "", attacks$Activity)
kept_activities <- c("diving", "race", "windsurfing", "walking", "wading", "wade fishing", "touching", "swimming", "surfing", "surf", "standing", "spearfishing", "snorkeling", "scuba diving", "playing", "paddle", "murder", "kayaking", "kayak", "floating", "fishing")
# We create an expression pattern that matches any of the specific words
pattern <- paste(kept_activities, collapse = "|")
# Extract only the specific words (kept_activities) from the column and replace the rest with an empty string
attacks$Activity <- str_extract(attacks$Activity, paste0("\\b(?:", pattern, ")\\b"))
attacks <- attacks %>%
mutate(Activity = str_replace_all(Activity, "\\bkayaking\\b", "kayak")) %>%
mutate(Activity = str_replace_all(Activity, "\\bsurfing\\b", "surf"))
attacks$Activity[is.na(attacks$Activity)] <- "other" # for all activities that are not part of the list or for NA, we just put Other, since no information is available
#_____________________________________________________________________________________________
#FATAL
#When we run a table for the Fatality column, we see that almost observation follow the same category (Y = yes, N = No, Unknwn). An observation has a 2017, which is probably a typo (which we replace by empty case) while another one has an M. We want to believe that it was a typo as well, and that "M" was typed instead of "N".
attacks$Fatal..Y.N. <- gsub("2017", "", attacks$Fatal..Y.N.)
attacks$Fatal..Y.N. <- gsub("M", "N", attacks$Fatal..Y.N.)
attacks$Fatal..Y.N. <-na_if(attacks$Fatal..Y.N., "")
#For the rest of NA, we have no information, so we took of these few observations.
attacks <- subset(attacks, !is.na(Fatal..Y.N.))
#Replace Yes by 1, 0 otherwise
attacks$Fatal..Y.N. <- ifelse(attacks$Fatal..Y.N. == "Y", 1, 0)
names(attacks)[names(attacks) == "Fatal..Y.N."] <- "Fatality"
#_____________________________________________________________________________________________
#final check up:
#We run a table for all columns, in order to check if columns have any missing value or if we still have other mistakes.
table(attacks$Date) #this one is fine
table(attacks$Year) #this one is fine
table(attacks$Type) #here we have boating and boat which mean the same thing. Let us group them
attacks$Type <- gsub("Boating", "Boat", attacks$Type)
attacks$Type <- gsub("Boatomg", "Boat", attacks$Type)
table(attacks$Type) #this one is fine
table(attacks$Country)#this one is fine
table(attacks$Activity)#this one is fine
table(attacks$Age)#this one is fine
table(attacks$Fatality)#this one is fine
table(attacks$Species)#this one is fine
#With the following code, we will add a new column which will associate each country with its ISO code. This is a very important step for the cleaning if the enxt sections, because il will allow us to match the different datasets.
# Get ISO country codes
iso_codes <- countrycode(attacks$Country, "country.name", "iso3c")
attacks$ISO_Code <- countrycode(attacks$Country, "country.name", "iso3c")
#Through the following function we can spot NAs. We can see that there are 23 countries with no ISO Code. We will fix them by hand.
rows_with_na <- which(is.na(attacks$ISO_Code))
attacks$ISO_Code[attacks$Country %in% c("ENGLAND", "SCOTLAND", "BRITISH ISLES") & is.na(attacks$ISO_Code)] <- "GB"
attacks$ISO_Code[attacks$Country %in% c("AZORES") & is.na(attacks$ISO_Code)] <- "PRT"
attacks$ISO_Code[attacks$Country %in% c("ST. MAARTIN", "ST. MARTIN") & is.na(attacks$ISO_Code)] <- "MAF"
attacks$ISO_Code[attacks$Country %in% c("OKINAWA") & is.na(attacks$ISO_Code)] <- "JPN"
attacks$ISO_Code[attacks$Country %in% c("MICRONESIA") & is.na(attacks$ISO_Code)] <- "FSM"
attacks <- subset(attacks, !is.na(ISO_Code))
DT::datatable(attacks, options = list(pageLength = 10))
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes
The main mission for this dataset was to change the entire structure of the dataset. Originally, the dataset had the years as columns (one column for each year), while the rows were represented by the different countries; the temperature observations were presented by month, quarter and year. Furthermore, for each country and each month/quarter/year, the dataset shows both the change in temperatures and the standard deviation.
To proceed with the cleaning, we eliminated the rows corresponding to the standard deviation (we will not work with it), and the monthly and quarterly observations: we are only interested in yearly temperature variations.
We then rotated the dataset so as to have all the years under a single column; we subsequently matched each country and year with its temperature change observation. Finally we added the ISO code to merge with the other datasets.
temperature <- read_xlsx(here::here("data/Temperature.xlsx"))
#R did not read the "°C" on the column Unit, so we changed it manually
temperature <- temperature %>% mutate(Unit = "°C")
#We take off all the columns that we do not need for our study
temperature <- temperature %>% select(-'Area Code (M49)', -'Months Code', -'Element Code')
#We wliminate all columns having an F at the end (For each year there are 2 columns (ex: 1961 - 1961F; 1962 - 1962F). The column with the F at the end of the Year is a column where only the letter E appears for each row. We tried to look for its meaning on the FAO website but we did not succeed. Therefore, we removed it. What is more, it is important to notice that it was in any case unimportant for our work).
columns_to_remove <- grep("F$", names(temperature), value = TRUE)
temperature <- temperature[, !(names(temperature) %in% columns_to_remove)]
# We keep only meteorological year, we don't need to work with monthly and seasonal data.
target_name <- "Meteorological year"
temperature <- temperature[temperature$Months == target_name, ]
target_name2 <- "Temperature change"
temperature <- temperature[temperature$Element == target_name2, ]
# We take off the rest of the columns that we do not need
temperature <- temperature %>% select(-'Area Code', -'Months', -'Element', -'Unit')
#The column of each year starts with an Y (Y1961, Y1961...). We take it off, so that we can easily match data with other datasets.
new_colnames <- gsub("Y", "", colnames(temperature))
colnames(temperature) <- new_colnames
temperature <- na.omit(temperature)
# We transpose the dataset so that columns become rows. With this function we have a single column, where each row corresponds to a year, but countries became the columns now. We will work on this later in the code.
temperature <- t(temperature)
# I want the first row to be the header
colnames(temperature) <- temperature[1, ]
clean_temperature <- temperature[-1, ]
#col names in CAPS
colnames(clean_temperature) <- toupper(colnames(clean_temperature))
library(tidyr)
# Convert the matrix/array "clean_temperatures" to a data frame
temperature <- as.data.frame(clean_temperature)
# Add 'Year' as a separate column
temperature$Year <- rownames(clean_temperature)
# Reshape the data from wide to long format
temperature <- tidyr::pivot_longer(temperature,
cols = -Year,
names_to = "Country",
values_to = "Temperature")
# Reorder columns as per your desired format
temperature <- temperature[c("Year", "Country", "Temperature")]
#As we said at the beginning, we only work with data from 1970
temperature <- temperature %>% filter(Year >= 1970)
temperature$Year <- as.numeric(temperature$Year)
temperature <- temperature %>%
mutate(Country = ifelse(Country == "UNITED STATES OF AMERICA", "USA", Country))
#As for the attacks dataset, we add a column for the ISO code.
temperature$ISO_Code <- countrycode(temperature$Country, "country.name", "iso3c")
temperature$ISO_Code[temperature$Country %in% c("MICRONESIA") & is.na(temperature$ISO_Code)] <- "FSM"
#Now that we put ISO code, we can remove the country column
temperature <- temperature %>% select(-'Country')
DT::datatable(temperature, options = list(pageLength = 10))
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes
We choose a data set about the sea level change observed during the last years. We start analyzing the sea level changes since 1993 because we did not find any data of years before. In this data set we had few columns but we only kept 2 columns that were the more relevant for us: the year and the GMSL_GIA. This last column is the global mean of sea level including the glacial isostatic adjustement (GIA). We choose the column with GIA and not without because this show the response of solid Earth to the past changes in surface loading by ice and water such as the glaciation and deglaciation.
sealevel <- read.csv(here::here("data/sealevel.csv"))
# keep the 2 columns we are interested to analyze because the other are irrelevant for our project
sealevel <- subset(sealevel, select = c(Year, GMSL_GIA))
#Show the year only the first time by creating a new column called Year2
sealevel$Year2 <- ifelse(duplicated(sealevel$Year), NA, sealevel$Year)
# delete column Year due to the creation of column Year 2
sealevel <- subset(sealevel, select = -Year)
# delete NA because it does not bring anything to our analysis
sealevel <- na.omit(sealevel)
#Change name of column Year
column <- gsub("2","",colnames(sealevel))
colnames(sealevel) <- column
DT::datatable(sealevel, options = list(pageLength = 10))
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes
Finally, the GHG emissions dataset was the easiest to clean. Indeed, we already have a column for the ISO code, so we simply had to filter the years so that the observations began in 1970.
co2 <- read.csv(here::here("data/CO2.csv"))
# Change names of columns in order to have the same columns that in the other datasets
names(co2)[names(co2) == "Entity"] <- "Country"
names(co2)[names(co2) == "Annual.CO..emissions"] <- "Annual CO2 Emissions"
names(co2)[names(co2) == "Code"] <- "ISO_Code"
# keep the 3 columns we are interested to analyze
co2 <- subset(co2, select = c("ISO_Code", "Year", "Annual CO2 Emissions"))
# only keep information starting in 1970 because we want to look at the evolution of
# the last 50 years
co2<- co2[co2$Year >= 1970, ]
co2$ISO_Code <- na_if(co2$ISO_Code, "")
co2 <- subset(co2, !is.na(ISO_Code))
DT::datatable(co2, options = list(pageLength = 10))
<<<<<<< Updated upstream
=======
>>>>>>> Stashed changes
After cleaning the four datasets, we have finally merged them. This step is essential to be able to run regression across variables that came from multiple datasets.
#MERGE FIRST TWO DATA SETS
# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data <- left_join(attacks, temperature, by = c("Year", "ISO_Code"))
#MERGE WITH THIRD
# Assuming 'Year' is the columns in both datasets for matching
merged_data2 <- left_join(merged_data, sealevel, by = c("Year"))
#MERGE WITH FOURTH
# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data3 <- left_join(merged_data2, co2, by = c("Year", "ISO_Code"))
merged_data3$Temperature <- as.numeric(as.character(merged_data3$Temperature))
# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data3 <- left_join(merged_data2, co2, by = c("Year", "ISO_Code"))
#Change name of a column
colnames(merged_data3)[colnames(merged_data3) == 'Country.x'] <- 'Country'
merged_data3$Temperature <- as.numeric(merged_data3$Temperature)
#for merged_data3, which we will use for our regressions, we need numerical variables
#therefore, we will make changes in some categorical ones
#TYPE
#Let's transform the Type variable on numeric. 1 corresponds to Boat, 2 to Unprovoked, Invalid to 3
# Provoked to 4, Questionable to 5 and Sea Disaster to 6
categories <- c("Boat" = 1, "Unprovoked" = 2, "Invalid" = 3, "Provoked" = 4, "Questionable" = 5, "Sea Disaster" = 6)
merged_data3$Type <- factor(merged_data3$Type, levels = names(categories))
merged_data3$Type <- as.numeric(merged_data3$Type)
#TIME
# Let's focus on the transformation of time where morning correspond to 0, afternoon to 1,
# evening to 2 and night to 3
merged_data3$Time <- as.numeric(factor(merged_data3$Time, levels = c("morning", "afternoon", "evening", "night")))
#SEX
merged_data3$Sex <- ifelse(merged_data3$Sex == "M", 0,
ifelse(merged_data3$Sex == "F", 1, 2))
#SPECIES
merged_data3 <- merged_data3 %>%
mutate(Species = as.numeric(factor(Species)))
# Get the top 5 species
top_species <- merged_data3 %>%
count(Species, sort = TRUE) %>%
slice_head(n = 5)
# Create a logical condition to include only the top 5 species
condition <- merged_data3$Species %in% top_species$Species
# Subset the data based on the condition
species_filtered_data <- merged_data3[condition, ]
DT::datatable(merged_data3, options = list(pageLength = 10))
<<<<<<< Updated upstream
#> # A tibble: 69 x 2
#> Country Attackscountry
#> <chr> <int>
#> 1 ARUBA 1
#> 2 AUSTRALIA 520
#> 3 BAHAMAS 77
#> 4 BRAZIL 96
#> 5 CANADA 1
#> 6 CHILE 1
#> 7 CHINA 4
#> 8 COLOMBIA 3
#> 9 COSTA RICA 5
#> 10 CROATIA 3
#> # i 59 more rows
After a deep Exploratory Data Analysis we thought about the relevance of some factors that might influence directly the occurrence of shark attacks. Directly in the sense that these are variables that were found in the main data set and not in a secondary one. Let’s begin then:
Before doing any regression, we will compute the correlation matrix in order to analyze our numerical variables
#> Year Type Sex Age Time Species
#> Year 1.00000 0.0280 -0.0546 0.09953 0.00641 -0.07148
#> Type 0.02797 1.0000 -0.1276 0.03761 0.03376 -0.10931
#> Sex -0.05457 -0.1276 1.0000 -0.04967 0.01828 0.01406
#> Age 0.09953 0.0376 -0.0497 1.00000 -0.10560 -0.00215
#> Time 0.00641 0.0338 0.0183 -0.10560 1.00000 0.00836
#> Species -0.07148 -0.1093 0.0141 -0.00215 0.00836 1.00000
The regression that we will use is the following one:
\[ \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \epsilon \]
where:
Let’s do our main regression based on all the countries
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Type +
#> Time, data = merged_map)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1208 -561 350 529 1219
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6116.783 1764.298 -3.47 0.00053 ***
#> Year 3.716 0.879 4.23 2.4e-05 ***
#> Sex -33.807 20.081 -1.68 0.09239 .
#> Age -9.556 0.855 -11.18 < 2e-16 ***
#> Species -2.017 1.330 -1.52 0.12949
#> Type -45.688 15.115 -3.02 0.00253 **
#> Time 6.008 16.467 0.36 0.71525
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 603 on 2894 degrees of freedom
#> Multiple R-squared: 0.0509, Adjusted R-squared: 0.049
#> F-statistic: 25.9 on 6 and 2894 DF, p-value: <2e-16
| Dependent variable: | |
| Attackscountry | |
| Year | 3.720*** |
| (0.879) | |
| Sex | -33.800* |
| (20.100) | |
| Age | -9.560*** |
| (0.855) | |
| Species | -2.020 |
| (1.330) | |
| Type | -45.700*** |
| (15.100) | |
| Time | 6.010 |
| (16.500) | |
| Constant | -6,117.000*** |
| (1,764.000) | |
| Observations | 2,901 |
| R2 | 0.051 |
| Adjusted R2 | 0.049 |
| Residual Std. Error | 603.000 (df = 2894) |
| F Statistic | 25.900*** (df = 6; 2894) |
| Note: | p<0.1; p<0.05; p<0.01 |
This first regression will be based only on the USA
#> [1] "Regression for USA:"
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Type +
#> Time, data = usa_data)
#>
#> Residuals:
#> Min 1Q Median 3Q
#> -0.0000000014306 -0.0000000000001 0.0000000000011 0.0000000000021
#> Max
#> 0.0000000000041
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 1479.0000000001689386 0.0000000001644062
#> Year -0.0000000000001034 0.0000000000000819
#> Sex 0.0000000000012479 0.0000000000019505
#> Age 0.0000000000000542 0.0000000000000701
#> Species -0.0000000000000568 0.0000000000001135
#> Type 0.0000000000000894 0.0000000000016119
#> Time -0.0000000000002874 0.0000000000014769
#> t value Pr(>|t|)
#> (Intercept) 8996012496841.87 <0.0000000000000002 ***
#> Year -1.26 0.21
#> Sex 0.64 0.52
#> Age 0.77 0.44
#> Species -0.50 0.62
#> Type 0.06 0.96
#> Time -0.19 0.85
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000000000373 on 1472 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.498
#> F-statistic: 245 on 6 and 1472 DF, p-value: <0.0000000000000002
#> Year Sex Age Species Type Time
#> 1.03 1.02 1.03 1.03 1.04 1.01
| Dependent variable: | |
| Attackscountry | |
| Year | -0.000 |
| (0.000) | |
| Sex | 0.000 |
| (0.000) | |
| Age | 0.000 |
| (0.000) | |
| Species | -0.000 |
| (0.000) | |
| Type | 0.000 |
| (0.000) | |
| Time | -0.000 |
| (0.000) | |
| Constant | 1,479.000*** |
| (0.000) | |
| Observations | 1,479 |
| R2 | 0.500 |
| Adjusted R2 | 0.498 |
| Residual Std. Error | 0.000 (df = 1472) |
| F Statistic | 245.000*** (df = 6; 1472) |
| Note: | p<0.1; p<0.05; p<0.01 |
#> Variable added: Year | AIC: -66827
#>
#> Call:
#> lm(formula = Attackscountry ~ Year, data = usa_data)
#>
#> Residuals:
#> Min 1Q Median 3Q
#> -0.0000000014319 0.0000000000003 0.0000000000012 0.0000000000019
#> Max
#> 0.0000000000025
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 1479.0000000001439275 0.0000000001614318
#> Year -0.0000000000000911 0.0000000000000807
#> t value Pr(>|t|)
#> (Intercept) 9161765526064.88 <0.0000000000000002 ***
#> Year -1.13 0.26
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000000000373 on 1477 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.5
#> F-statistic: 1.48e+03 on 1 and 1477 DF, p-value: <0.0000000000000002
#>
#> Variable added: Sex | AIC: -66826
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex, data = usa_data)
#>
#> Residuals:
#> Min 1Q Median 3Q
#> -0.0000000014315 0.0000000000002 0.0000000000012 0.0000000000019
#> Max
#> 0.0000000000028
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 1479.0000000001498393 0.0000000001618005
#> Year -0.0000000000000943 0.0000000000000809
#> Sex 0.0000000000011236 0.0000000000019330
#> t value Pr(>|t|)
#> (Intercept) 9140887108111.84 <0.0000000000000002 ***
#> Year -1.17 0.24
#> Sex 0.58 0.56
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000000000373 on 1476 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.499
#> F-statistic: 738 on 2 and 1476 DF, p-value: <0.0000000000000002
#>
#> Variable added: Age | AIC: -66824
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age, data = usa_data)
#>
#> Residuals:
#> Min 1Q Median 3Q
#> -0.0000000014309 0.0000000000000 0.0000000000011 0.0000000000021
#> Max
#> 0.0000000000038
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 1479.0000000001623448 0.0000000001625604
#> Year -0.0000000000001011 0.0000000000000813
#> Sex 0.0000000000012543 0.0000000000019402
#> Age 0.0000000000000553 0.0000000000000695
#> t value Pr(>|t|)
#> (Intercept) 9098156172803.24 <0.0000000000000002 ***
#> Year -1.24 0.21
#> Sex 0.65 0.52
#> Age 0.80 0.43
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000000000373 on 1475 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.499
#> F-statistic: 492 on 3 and 1475 DF, p-value: <0.0000000000000002
#>
#> Variable added: Species | AIC: -66823
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species, data = usa_data)
#>
#> Residuals:
#> Min 1Q Median 3Q
#> -0.0000000014307 -0.0000000000001 0.0000000000011 0.0000000000021
#> Max
#> 0.0000000000041
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 1479.0000000001696208 0.0000000001632457
#> Year -0.0000000000001039 0.0000000000000815
#> Sex 0.0000000000012481 0.0000000000019407
#> Age 0.0000000000000558 0.0000000000000696
#> Species -0.0000000000000577 0.0000000000001121
#> t value Pr(>|t|)
#> (Intercept) 9059961123428.15 <0.0000000000000002 ***
#> Year -1.27 0.20
#> Sex 0.64 0.52
#> Age 0.80 0.42
#> Species -0.51 0.61
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000000000373 on 1474 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.499
#> F-statistic: 368 on 4 and 1474 DF, p-value: <0.0000000000000002
#>
#> Variable added: Time | AIC: -66821
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time,
#> data = usa_data)
#>
#> Residuals:
#> Min 1Q Median 3Q
#> -0.0000000014306 -0.0000000000001 0.0000000000011 0.0000000000021
#> Max
#> 0.0000000000041
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 1479.0000000001700755 0.0000000001633070
#> Year -0.0000000000001038 0.0000000000000816
#> Sex 0.0000000000012382 0.0000000000019420
#> Age 0.0000000000000544 0.0000000000000700
#> Species -0.0000000000000577 0.0000000000001122
#> Time -0.0000000000002849 0.0000000000014757
#> t value Pr(>|t|)
#> (Intercept) 9056559928503.87 <0.0000000000000002 ***
#> Year -1.27 0.20
#> Sex 0.64 0.52
#> Age 0.78 0.44
#> Species -0.51 0.61
#> Time -0.19 0.85
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000000000373 on 1473 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.498
#> F-statistic: 295 on 5 and 1473 DF, p-value: <0.0000000000000002
#>
#> Variable added: Type | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time +
#> Type, data = usa_data)
#>
#> Residuals:
#> Min 1Q Median 3Q
#> -0.0000000014306 -0.0000000000001 0.0000000000011 0.0000000000021
#> Max
#> 0.0000000000041
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 1479.0000000001689386 0.0000000001644062
#> Year -0.0000000000001034 0.0000000000000819
#> Sex 0.0000000000012479 0.0000000000019505
#> Age 0.0000000000000542 0.0000000000000701
#> Species -0.0000000000000568 0.0000000000001135
#> Time -0.0000000000002874 0.0000000000014769
#> Type 0.0000000000000894 0.0000000000016119
#> t value Pr(>|t|)
#> (Intercept) 8996012496841.87 <0.0000000000000002 ***
#> Year -1.26 0.21
#> Sex 0.64 0.52
#> Age 0.77 0.44
#> Species -0.50 0.62
#> Time -0.19 0.85
#> Type 0.06 0.96
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000000000373 on 1472 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.498
#> F-statistic: 245 on 6 and 1472 DF, p-value: <0.0000000000000002
In order to have a better model by doing a forward selection where we will take the model with the lowest AIC.
In order to have a better model by doing a forward selection where we will take the model with the lowest AIC.
Here we have the AIC values for each model: 1. Model with Year only: AIC = -66827.49 2. Model with Year and Sex: AIC = -66825.83 3. Model with Year, Sex, and Age: AIC = -66824.46 4. Model with Year, Sex, Age, and Species: AIC = -66822.72 5. Model with Year, Sex, Age, Species and Time: AIC = -66820.77 6. Model with Year, Sex, Age, Species, Time and Type: AIC = -66818.77
As said previously, we chose the model with the lowest AIC indicating a better fit of the model, but we decided to choose it because the model includes more variables than the others, thus best explaining the variability in the data.
This second regression will be based only on Australia
#> [1] "Regression for AUSTRALIA:"
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Type +
#> Time, data = aus_data)
#>
#> Residuals:
#> Min 1Q Median
#> -0.00000000012389 0.00000000000007 0.00000000000024
#> 3Q Max
#> 0.00000000000051 0.00000000000062
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 519.999999999993406163 0.000000000037820421
#> Year 0.000000000000000107 0.000000000000018791
#> Sex 0.000000000000195984 0.000000000000413708
#> Age 0.000000000000003260 0.000000000000020697
#> Species -0.000000000000004605 0.000000000000032269
#> Type 0.000000000000052540 0.000000000000292085
#> Time 0.000000000000332463 0.000000000000342080
#> t value Pr(>|t|)
#> (Intercept) 13749186062032.52 <0.0000000000000002 ***
#> Year 0.01 1.00
#> Sex 0.47 0.64
#> Age 0.16 0.87
#> Species -0.14 0.89
#> Type 0.18 0.86
#> Time 0.97 0.33
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000000000548 on 513 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.494
#> F-statistic: 85.5 on 6 and 513 DF, p-value: <0.0000000000000002
| Dependent variable: | |
| Attackscountry | |
| Year | 0.000 |
| (0.000) | |
| Sex | 0.000 |
| (0.000) | |
| Age | 0.000 |
| (0.000) | |
| Species | -0.000 |
| (0.000) | |
| Type | 0.000 |
| (0.000) | |
| Time | 0.000 |
| (0.000) | |
| Constant | 520.000*** |
| (0.000) | |
| Observations | 520 |
| R2 | 0.500 |
| Adjusted R2 | 0.494 |
| Residual Std. Error | 0.000 (df = 513) |
| F Statistic | 85.500*** (df = 6; 513) |
| Note: | p<0.1; p<0.05; p<0.01 |
#> Variable added: Year | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year, data = aus_data)
#>
#> Residuals:
#> Min 1Q Median
#> -0.00000000012420 0.00000000000023 0.00000000000025
#> 3Q Max
#> 0.00000000000026 0.00000000000027
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 519.99999999999772626 0.00000000003604512
#> Year -0.00000000000000173 0.00000000000001800
#> t value Pr(>|t|)
#> (Intercept) 14426365221367.6 <0.0000000000000002 ***
#> Year -0.1 0.92
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000000000546 on 518 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.499
#> F-statistic: 517 on 1 and 518 DF, p-value: <0.0000000000000002
#>
#> Variable added: Sex | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex, data = aus_data)
#>
#> Residuals:
#> Min 1Q Median
#> -0.00000000012415 0.00000000000027 0.00000000000029
#> 3Q Max
#> 0.00000000000029 0.00000000000030
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 519.999999999995338840 0.000000000036493311
#> Year -0.000000000000000546 0.000000000000018213
#> Sex 0.000000000000176122 0.000000000000402815
#> t value Pr(>|t|)
#> (Intercept) 14249186641170.37 <0.0000000000000002 ***
#> Year -0.03 0.98
#> Sex 0.44 0.66
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000000000547 on 517 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.498
#> F-statistic: 259 on 2 and 517 DF, p-value: <0.0000000000000002
#>
#> Variable added: Age | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age, data = aus_data)
#>
#> Residuals:
#> Min 1Q Median
#> -0.00000000012415 0.00000000000027 0.00000000000029
#> 3Q Max
#> 0.00000000000029 0.00000000000030
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 519.999999999995338840 0.000000000036790730
#> Year -0.000000000000000566 0.000000000000018402
#> Sex 0.000000000000176164 0.000000000000403241
#> Age 0.000000000000000161 0.000000000000020290
#> t value Pr(>|t|)
#> (Intercept) 14133995074177.91 <0.0000000000000002 ***
#> Year -0.03 0.98
#> Sex 0.44 0.66
#> Age 0.01 0.99
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000000000547 on 516 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.497
#> F-statistic: 172 on 3 and 516 DF, p-value: <0.0000000000000002
#>
#> Variable added: Species | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species, data = aus_data)
#>
#> Residuals:
#> Min 1Q Median
#> -0.00000000012414 0.00000000000021 0.00000000000030
#> 3Q Max
#> 0.00000000000031 0.00000000000034
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 519.999999999996248334 0.000000000037075547
#> Year -0.000000000000000868 0.000000000000018490
#> Sex 0.000000000000180007 0.000000000000404139
#> Age 0.000000000000000389 0.000000000000020345
#> Species -0.000000000000005997 0.000000000000032010
#> t value Pr(>|t|)
#> (Intercept) 14025416977620.41 <0.0000000000000002 ***
#> Year -0.05 0.96
#> Sex 0.45 0.66
#> Age 0.02 0.98
#> Species -0.19 0.85
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000000000548 on 515 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.496
#> F-statistic: 129 on 4 and 515 DF, p-value: <0.0000000000000002
#>
#> Variable added: Time | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time,
#> data = aus_data)
#>
#> Residuals:
#> Min 1Q Median
#> -0.00000000012390 0.00000000000010 0.00000000000022
#> 3Q Max
#> 0.00000000000051 0.00000000000062
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 519.999999999994656719 0.000000000037106145
#> Year -0.000000000000000473 0.000000000000018494
#> Sex 0.000000000000180390 0.000000000000404142
#> Age 0.000000000000003587 0.000000000000020597
#> Species -0.000000000000005281 0.000000000000032018
#> Time 0.000000000000338769 0.000000000000339958
#> t value Pr(>|t|)
#> (Intercept) 14013851309421.74 <0.0000000000000002 ***
#> Year -0.03 0.98
#> Sex 0.45 0.66
#> Age 0.17 0.86
#> Species -0.16 0.87
#> Time 1.00 0.32
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000000000548 on 514 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.495
#> F-statistic: 103 on 5 and 514 DF, p-value: <0.0000000000000002
#>
#> Variable added: Type | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time +
#> Type, data = aus_data)
#>
#> Residuals:
#> Min 1Q Median
#> -0.00000000012389 0.00000000000007 0.00000000000024
#> 3Q Max
#> 0.00000000000051 0.00000000000062
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 519.999999999993406163 0.000000000037820421
#> Year 0.000000000000000107 0.000000000000018791
#> Sex 0.000000000000195984 0.000000000000413708
#> Age 0.000000000000003260 0.000000000000020697
#> Species -0.000000000000004605 0.000000000000032269
#> Time 0.000000000000332463 0.000000000000342080
#> Type 0.000000000000052540 0.000000000000292085
#> t value Pr(>|t|)
#> (Intercept) 13749186062032.53 <0.0000000000000002 ***
#> Year 0.01 1.00
#> Sex 0.47 0.64
#> Age 0.16 0.87
#> Species -0.14 0.89
#> Time 0.97 0.33
#> Type 0.18 0.86
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000000000548 on 513 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.494
#> F-statistic: 85.5 on 6 and 513 DF, p-value: <0.0000000000000002
#> Year Sex Age Species Type Time
#> 1.08 1.07 1.06 1.03 1.09 1.04
Let’s focus on both countries
#> [1] "Regression for USA and AUSTRALIA:"
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Type +
#> Time, data = combined_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -917 -496 190 250 603
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3786.535 1512.587 2.50 0.01238 *
#> Year -1.123 0.753 -1.49 0.13616
#> Sex -23.274 17.490 -1.33 0.18343
#> Age -5.739 0.677 -8.48 < 0.0000000000000002 ***
#> Species -4.165 1.103 -3.77 0.00017 ***
#> Type -28.178 13.689 -2.06 0.03968 *
#> Time 27.132 13.682 1.98 0.04750 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 410 on 1992 degrees of freedom
#> Multiple R-squared: 0.0514, Adjusted R-squared: 0.0485
#> F-statistic: 18 on 6 and 1992 DF, p-value: <0.0000000000000002
| Dependent variable: | |
| Attackscountry | |
| Year | -1.120 |
| (0.753) | |
| Sex | -23.300 |
| (17.500) | |
| Age | -5.740*** |
| (0.677) | |
| Species | -4.170*** |
| (1.100) | |
| Type | -28.200** |
| (13.700) | |
| Time | 27.100** |
| (13.700) | |
| Constant | 3,787.000** |
| (1,513.000) | |
| Observations | 1,999 |
| R2 | 0.051 |
| Adjusted R2 | 0.049 |
| Residual Std. Error | 410.000 (df = 1992) |
| F Statistic | 18.000*** (df = 6; 1992) |
| Note: | p<0.1; p<0.05; p<0.01 |
#> Variable added: Year | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year, data = combined_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -759 -684 244 259 275
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4354.25 1520.51 2.86 0.0042 **
#> Year -1.56 0.76 -2.06 0.0400 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 420 on 1997 degrees of freedom
#> Multiple R-squared: 0.00211, Adjusted R-squared: 0.00161
#> F-statistic: 4.22 on 1 and 1997 DF, p-value: 0.04
#>
#> Variable added: Sex | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex, data = combined_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -762 -682 242 261 291
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4357.61 1520.77 2.87 0.0042 **
#> Year -1.56 0.76 -2.06 0.0400 *
#> Sex -10.32 17.73 -0.58 0.5608
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 421 on 1996 degrees of freedom
#> Multiple R-squared: 0.00228, Adjusted R-squared: 0.00128
#> F-statistic: 2.28 on 2 and 1996 DF, p-value: 0.103
#>
#> Variable added: Age | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age, data = combined_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -838 -509 187 242 614
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2938.621 1498.992 1.96 0.05 .
#> Year -0.771 0.750 -1.03 0.30
#> Sex -20.014 17.418 -1.15 0.25
#> Age -6.093 0.672 -9.06 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 412 on 1995 degrees of freedom
#> Multiple R-squared: 0.0417, Adjusted R-squared: 0.0403
#> F-statistic: 28.9 on 3 and 1995 DF, p-value: <0.0000000000000002
#>
#> Variable added: Species | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species, data = combined_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -910 -501 194 245 623
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3444.180 1501.479 2.29 0.02190 *
#> Year -0.959 0.750 -1.28 0.20072
#> Sex -19.307 17.369 -1.11 0.26646
#> Age -5.997 0.671 -8.94 < 0.0000000000000002 ***
#> Species -3.877 1.095 -3.54 0.00041 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 411 on 1994 degrees of freedom
#> Multiple R-squared: 0.0477, Adjusted R-squared: 0.0458
#> F-statistic: 25 on 4 and 1994 DF, p-value: <0.0000000000000002
#>
#> Variable added: Time | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time,
#> data = combined_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -907 -504 192 249 615
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3380.836 1500.909 2.25 0.02440 *
#> Year -0.953 0.749 -1.27 0.20337
#> Sex -18.691 17.361 -1.08 0.28181
#> Age -5.839 0.676 -8.64 < 0.0000000000000002 ***
#> Species -3.858 1.094 -3.53 0.00043 ***
#> Time 25.669 13.675 1.88 0.06065 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 411 on 1993 degrees of freedom
#> Multiple R-squared: 0.0494, Adjusted R-squared: 0.047
#> F-statistic: 20.7 on 5 and 1993 DF, p-value: <0.0000000000000002
#>
#> Variable added: Type | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time +
#> Type, data = combined_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -917 -496 190 250 603
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3786.535 1512.587 2.50 0.01238 *
#> Year -1.123 0.753 -1.49 0.13616
#> Sex -23.274 17.490 -1.33 0.18343
#> Age -5.739 0.677 -8.48 < 0.0000000000000002 ***
#> Species -4.165 1.103 -3.77 0.00017 ***
#> Time 27.132 13.682 1.98 0.04750 *
#> Type -28.178 13.689 -2.06 0.03968 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 410 on 1992 degrees of freedom
#> Multiple R-squared: 0.0514, Adjusted R-squared: 0.0485
#> F-statistic: 18 on 6 and 1992 DF, p-value: <0.0000000000000002
We have done the following type of models of this first regression that will help us to answer this first research question: - Overall: all countries that we have on our merged dataset. - Focus on USA. - Focus on Australia. - Focus on the two countries above.
Each type of model according to their focus has its own primary factors influencing the occurrence of shark attacks. For example, in the first one (Overall), Age, Year, Type and Time are the factors that influence the attacks. But the two that are highly significant are Age and Year. In the 4th one, the one that is focused on both countries, the main factors are Age, Species, Type and Time. In conclusion, Age, Type and Time are highly significant predictors, thus they strongly influence the occurrence of shark attacks from a statistical point of view. We did not consider models focus on one country because they are not that relevant. We choose to analyze more than 1 country because it makes more sense. We analyzed the two countries with the most shark attacks because we had the most data about these countries. It gives us more details and the significative factors that influence the shark attacks. We reduce the complexity of our model because of the focus on the regions with the most attacks, so it makes our results more relevant.
=======With the y-axis showing the frequency of shark attacks and the x-axis showing the passage of time, the graph below displays a striking pattern in shark attacks over time. The data exhibits a recognizable pattern that indicates a steady increase in the quantity of shark attacks within the designated period. This rising tendency begs critical concerns regarding what is causing such an increase and necessitates more research into ecological, environmental, and human-related elements that could be involved in this occurrence. The graph emphasizes the need of keeping an eye on and comprehending the dynamics of these episodes in addition to providing a visual depiction of the rise in shark attacks.
This trend is a key research topic: why are marine attacks increasing year after year? What could be the reasons? In fact, the reasons can be various, for example of an ecological, environmental nature or linked to human seaside activity. The graph highlights the need to keep an eye on and understand the dynamics of these episodes, which is exactly what we will do during our study.
Three fundamental areas of this graph should be an element of analysis: 1975-1978; 1988-1990; 2019-2020. In fact, we could ask ourselves, why is there a decrease in attacks in this period? For the last period, the answer is simple: our dataset presents data up to June 2018, for the second half of the year we have no data, which explains the drastic drop in attacks. For the other two periods, we will discover the reasons during a more in-depth analysis
plot1 <- ggplot(data = attacks, aes(x = Year)) +
geom_smooth(stat = "count", aes(group = 1), color = "blue", size = 1) +
geom_bar(stat = "count", fill = "red") +
labs(title = "Shark Attacks Evolution Over Years", x = "Year", y = "Number of Attacks")
plot1
Pattern of shark attacks frequencies through the day
The bar graph below, where the x-axis indicates four time categories —morning, afternoon, evening, and night— displays the distribution of shark attacks throughout the day. The graph’s most noticeable aspect is how frequently shark attacks occur in the afternoon, followed by the morning, evening and night. This concentration of incidents in the afternoon raises intriguing questions about potential human behavioral factors that could contribute to this temporal pattern.
First and foremost, we believe that the afternoon is when most people choose to go swimming, which contributes to the frequency of shark attacks during this time. Recreational water activities such as swimming, surfing, kayaking, and other water sports are probably more popular during these times. Furthermore, the afternoon is when the seas are the warmest, so it’s intriguing to find out if this aspect also has a significant influence on shark activity. Conversely, the decreased number of assaults that occur at night may be related to less people using the water during that time, but there may be other possible reasons as well (less high temperatures, poorer visibility).
Lastly, we suppose that the medium frequencies of the morning and evening might correspond to transitional periods when shark and human behavior are changing.
bar_colors <- c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3")
# Create a sorted table
sorted_table <- table(attacks$Time)
sorted_table <- sorted_table[order(-sorted_table)]
# Create a bar plot with the sorted data
barplot1 <- barplot(sorted_table,
col = bar_colors,
xlab = "Time of Day", ylab = "Frequency of Attacks",
main = "Frequency of Shark Attacks by Time of Day",
border = "white",
ylim = c(0, 1800),
space = 0.5,
cex.names = 0.8,
font.axis = 2,
beside = TRUE)
<<<<<<< Updated upstream
model12 <- glm(Fatality ~ Date + Year + Age + Time,
data = filtered_data4,
family = "binomial")
| Dependent variable: | |
| Fatality | |
| Date | -0.006 |
| (0.026) | |
| Year | -0.030*** |
| (0.006) | |
| Age | 0.028*** |
| (0.006) | |
| Time | 0.086 |
| (0.124) | |
| Constant | 55.500*** |
| (12.300) | |
| Observations | 2,319 |
| Log Likelihood | -553.000 |
| Akaike Inf. Crit. | 1,115.000 |
| Note: | p<0.1; p<0.05; p<0.01 |
We will start by analiyzing the significant coefficient.
The intercept of 55.643 is the estimated log-odds of fatality when all predictors are zero. Since, our variable of interest is a binomal one, it can be difficult to interpret this log-odds directly, though, and it might not be clear how it affects the likelihood of death. This value needs a deepen econometric study in order to be comprehended.
The coefficient for ‘Year’ is -0.030, suggesting that the likelihood of a fatal outcome in shark attacks decreases by around 0.0324.as the year increases. This trend may be caused by developments in emergency medical response and care, public education and awareness campaigns about shark safety, enhanced beach surveillance and warning systems, or adjustments in recreational activities and behavior near coastal areas.
The ‘Age’ coefficient is 0.027, indicating that for each unit increase in age, the log-odds of fatality increase by 0.0269. One explanation for this correlation may be that older people have less physical strength and agility, which makes it harder for them to flee from or defend against a shark attack. Additionally, older people may be more vulnerable to severe injuries from shark bites. Furthermore, the observed higher risk of death in older age groups may be explained by behavioral factors such as an increased propensity to participate in riskier water activities or an increased amount of time spent in the water.
The last two coefficients, that of Date and Time are not significant, but let is dive into them!
The ‘Date’ coefficient is -0.006, but with a p-value of 0.85, it is not statistically significant, implying that the month of the shark attack may not be a strong predictor of fatality. But this was not a surprise. Indeed, the three countries we took into consideration have seasons that vary throughout the year, based on their position on the globe. For example, Summer in USA goes from June to September, but in Australia and South Africa it goes from December to February/March. This might create a disequilibrium when R reads them.
Remarkably, the ‘Time’ coefficient is 0.017, indicating that the time of day may not be a predictive factor for shark attack fatalities. The lack of significance of ‘Time’ may suggest that the time of an attack by a shark has little bearing on whether it ends fatally. Intuitively, we thought that night attacks would be more fatal, given the difficulty in asking or receiving help, but apparently this is not the case.
The model’s goodness of fit is assessed through the null and residual deviances, with a lower residual deviance of 1079 on 2190 degrees of freedom compared to the null deviance of 1120.4 on 2194 degrees of freedom. This reduction in deviance indicates that the model explains some of the variability in the data. The Akaike Information Criterion (AIC) of 1089 is a measure of the model’s relative quality, considering both fit and complexity.
======= # Add text labels with frequencies on the bars text_labels <- sorted_table text(x = barplot1, y = sorted_table, labels = text_labels, pos = 3, cex = 0.8, col = "black")Pattern of shark attacks frequencies per country
A first glance at our plotted data reveals that the United States, Australia, and South Africa are the three primary countries with a significant number of shark attacks.
shark_attacks_by_country <- attacks %>%
group_by(Country) %>%
summarise(total_attacks = n())
shark_attacks_by_country <- shark_attacks_by_country %>%
mutate(CountryCategory = ifelse(total_attacks >= 30, as.character(Country), "Other"))
# Order the countries by frequency in descending order
shark_attacks_by_country$CountryCategory <- reorder(shark_attacks_by_country$CountryCategory, -shark_attacks_by_country$total_attacks)
# Plot the data
plot2<- ggplot(data = shark_attacks_by_country, aes(x = total_attacks, y = CountryCategory)) +
geom_col(fill = "skyblue") +
labs(title = "Total Shark Attacks in Each Country", x = "Number of Attacks", y = "Country") +
theme_minimal() +
theme(axis.text.y = element_text(hjust = 1)) +
scale_y_discrete(labels = scales::label_wrap(10))
interactive_plot <- ggplotly(plot2)
interactive_plot
>>>>>>> Stashed changes
Are shark attacks more likely to lead to death?
IT WAS A PIECHART…
Fatalities by type of shark
As we can see in this graph, most of attacks were not fatal, meaning that sharks injured a lot of people without taking their lives away. The types of sharks that did the most of accidents were the following ones: Bull shark, Tiger shark and White shark, with the deadlies being the last one. Why do they commit attacks? Probably because their feeding behavior. Maybe they misidentify their predators attacking people instead of animals like turtles, seals, or sea lions. This is consistent with Bull shark and Tiger shark being repetitevly named as the deadlies sharks in the seas and oeceans (CBS News, 2010). Another relevant information derived from this graph is the big number of unidentified sharks. Out of all the attacks recorded by our dataset, we have almost 600 Unidentified species. It is such a pity not to be able to have this information, which would have led us to more insightful results.